[SOUND] Now lets look at another
behaviour of the Mixed Model and
in this case lets look at
the response to data frequencies.
So what you are seeing now is basically
the likelihood of function for
the two word document and
we now in this case the solution is text.
A probability of 0.9 and
the a probability of 0.1.
Now it's interesting to
think about a scenario where we start
adding more words to the document.
So what would happen if we add
many the's to the document?
Now this would change the game, right?
So, how?
Well, picture, what would
the likelihood function look like now?
Well, it start with the likelihood
function for the two words, right?
As we add more words, we know that.
But we have to just multiply
the likelihood function by
additional terms to account for
the additional.
occurrences of that.
Since in this case,
all the additional terms are the,
we're going to just multiply by this term.
Right?
For the probability of the.
And if we have another occurrence of the,
we'd multiply again by the same term,
and so on and forth.
Add as many terms as the number of
the's that we add to the document, d'.
Now this obviously changes
the likelihood function.
So what's interesting is now to think
about how would that change our solution?
So what's the optimal solution now?
Now, intuitively you'd know
the original solution,
pulling the 9 versus pulling the ,will no
longer be optimal for this new function.
Right?
But, the question is how
should we change it.
What general is to sum to one.
So he know we must take away some
probability the mass from one word and
add the probability
mass to the other word.
The question is which word to
have reduce the probability and
which word to have a larger probability.
And in particular,
let's think about the probability of the.
Should it be increased
to be more than 0.1?
Or should we decrease it to less than 0.1?
What do you think?
Now you might want to pause the video
a moment to think more about.
This question.
Because this has to do with understanding
of important behavior of a mixture model.
And indeed,
other maximum likelihood estimator.
Now if you look at the formula for
a moment, then you will see it seems like
another object Function is more
influenced by the than text.
Before, each computer.
So now as you can imagine,
it would make sense to actually
assign a smaller probability for
text and lock it.
To make room for
a larger probability for the.
Why?
Because the is repeated many times.
If we increase it a little bit,
it will have more positive impact.
Whereas a slight decrease of text
will have relatively small impact
because it occurred just one, right?
So this means there is another
behavior that we observe here.
That is high frequency words
generated with high probabilities
from all the distributions.
And, this is no surprise at all,
because after all, we are maximizing
the likelihood of the data.
So the more a word occurs, then it
makes more sense to give such a word
a higher probability because the impact
would be more on the likelihood function.
This is in fact a very general phenomenon
of all the maximum likelihood estimator.
But in this case, we can see as we
see more occurrences of a term,
it also encourages the unknown
distribution theta sub d
to assign a somewhat higher
probability to this word.
Now it's also interesting to think about
the impact of probability of Theta sub B.
The probability of choosing one
of the two component models.
Now we've been so far assuming
that each model is equally likely.
And that gives us 0.5.
But you can again look at this likelihood
function and try to picture what would
happen if we increase the probability
of choosing a background model.
Now you will see these terms for the,
we have a different form where
the probability that would be
even larger because the background has
a high probability for the word and
the coefficient in front of 0.9 which
is now 0.5 would be even larger.
When this is larger,
the overall result would be larger.
And that also makes this
the less important for
theta sub d to increase
the probability before the.
Because it's already very large.
So the impact here of increasing
the probability of the is somewhat
regulated by this coefficient,
the point of i.
If it's larger on the background,
then it becomes less important
to increase the value.
So this means the behavior here,
which is high frequency words tend to get
the high probabilities, are effected or
regularized somewhat by the probability
of choosing each component.
The more likely a component
is being chosen.
It's more important that to have higher
values for these frequent words.
If you have a various small probability of
being chosen, then the incentive is less.
So to summarize,
we have just discussed the mixture model.
And we discussed that the estimation
problem of the mixture model and
particular with this discussed some
general behavior of the estimator and
that means we can expect our
estimator to capture these infusions.
First every component model
attempts to assign high probabilities to
high frequent their words in the data.
And this is to collaboratively
maximize likelihood.
Second, different component models tend to
bet high probabilities on different words.
And this is to avoid a competition or
waste of probability.
And this would allow them to collaborate
more efficiently to maximize
the likelihood.
So, the probability of choosing each
component regulates the collaboration and
the competition between component models.
It would allow some component models
to respond more to the change,
for example, of frequency of
the theta point in the data.
We also talked about the special case
of fixing one component to a background
word distribution, right?
And this distribution can be estimated
by using a collection of documents,
a large collection of English documents,
by using just one distribution and
then we'll just have normalized
frequencies of terms to
give us the probabilities
of all these words.
Now when we use such
a specialized mixture model,
we show that we can effectively get rid
of that one word in the other component.
And that would make this cover
topic more discriminative.
This is also an example of imposing
a prior on the model parameter and
the prior here basically means one model
must be exactly the same as the background
language model and if you recall what we
talked about in Bayesian estimation, and
this prior will allow us to favor a model
that is consistent with our prior.
In fact, if it's not consistent we're
going to say the model is impossible.
So it has a zero prior probability.
That effectively excludes such a scenario.
This is also issue that
we'll talk more later.
[MUSIC]

